[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695
[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models#34695MatthewBonanni merged 3 commits intovllm-project:mainfrom
Conversation
…-project#34561) Fix AttributeError when using AWQ/GPTQ quantized MLA models (e.g., GLM-4.7-Flash-AWQ) by guarding `kv_b_proj.weight.dtype` accesses with `hasattr` checks and falling back to `params_dtype`. Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a crash that occurs when running MLA models with AWQ/GPTQ quantization. The root cause, an AttributeError from accessing a non-existent .weight attribute on quantized layers, is correctly identified. The fix is clean and robust, using hasattr guards to prevent the error. The fallback to params_dtype for quantized layers in _compute_prefill_context is a logical and well-justified approach. The changes are minimal, targeted, and well-documented in the pull request description. Overall, this is an excellent bugfix.
|
cc @LucasWilkinson @MatthewBonanni @pavanimajety to review |
|
I've checked that with this PR deepseek v3 awq checkpoint is loaded successfully and gets normal accuracy(gsm8k: 0.945) (naturally). The change looks accurate and simple, can we review this PR to unblock awq/gptq models from run on latest main w/ transformers v5? |
|
Thanks @cjackal 's test. @LucasWilkinson @MatthewBonanni @pavanimajety can you help to review? Thank you in advance. |
|
I also verified this patch fixs the error serving GLM-4.7-Flash-GPTQ-4bits, with single and batch requests. |
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, thanks for the contribution!
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com>
Restore three upstream changes in MLACommonImpl that were accidentally removed in initial AITER commits: 1. Add back logger.info_once for backend selection (TRT-LLM, FlashInfer, CUDNN, FlashAttention) - helpful for debugging 2. Restore FA4 support in _pad_v logic - FA4 natively handles different head dimensions like FA3 on Hopper 3. Restore params_dtype fallback for AWQ/GPTQ quantized models (PR vllm-project#34695) - Quantized layers may lack .weight attribute These changes are in MLACommonImpl (shared backend selector), not related to AITER fused kernel functionality which is in MLAAttention class. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…-project#34695) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
Fix
AttributeError: 'ColumnParallelLinear' object has no attribute 'weight'when running MLA models with AWQ/GPTQ quantization (e.g.,cyankiwi/GLM-4.7-Flash-AWQ-4bit).Closes #34561.
Root cause: MLA attention code accesses
self.kv_b_proj.weight.dtypein 3 places, but AWQ/GPTQ-quantizedColumnParallelLinearlayers store weights asqweight(packed int32), notweight. The code only accounted for unquantized and FP8-quantized weights.Fix: Guard
.weight.dtypeaccesses withhasattr(self.kv_b_proj, "weight")checks:MLAAttention.__init__): Addedhasattrguard in theandchain for the ROCm fp4 BMM check. Short-circuits toFalsefor AWQ/GPTQ — correct, since packed int32 weights can't be used with fp4 BMM.MLACommonImpl._compute_prefill_context): Introduced a local_kv_b_proj_w_dtypethat usesweight.dtypewhen available, falling back toparams_dtype(always present onLinearBase) for quantized layers.params_dtypeis the model's compute dtype (e.g., bf16), which is the correct input dtype that AWQ/GPTQ layers expect.Correctness verified across all quantization methods:
hasattr(weight)Test Plan
hasattrguards and 1params_dtypefallback.ColumnParallelLinearlayer behavior.Test Result